Introduction to Open Data Science - Course Project


How are you feeling right now? What do you expect to learn? Where did you hear about the course?


Also reflect on your learning experiences with the R for Health Data Science book and the Exercise Set 1: How did it work as a “crash course” on modern R tools and using RStudio? Which were your favorite topics? Which topics were most difficult? Some other comments on the book and our new approach of getting started with R Markdown etc.?


My GitHub repository


## [1] "The date today is"
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
## [1] "2023-11-13 14:17:31 EET"

The text does not continue here



Assignment 2: Data wrangling and analysis

Anne Tyvijärvi

The learning2014 data is based on the international survey of Approaches to Learning. Observations with “zero” value for exam points have been removed from the data. “Deep”, “stra” and “surf” are combination variables (averages) of individual measurements measuring the same “dimension”.The data frame includes 166 rows and 7 columns, and all data (except gender that is “character”) is numeric.


With the function plot() you can see that the majority of participants identified as female, and most participants were 20 - 30 years. There was a strong positive correlation between attitude and exam points (p < 0.001), and a negative correlation between deep learning (questions related to measuring the understanding of what is being studied) and attitude (p < 0.05), but also deep learning and learning strategy (making an effort to learn).

  • Based on the above results, I chose attitude, deep (relating to the “depth” of learning) and stra (learning strategies) as explanatory varibles for exam points, and tested these with the linear regression: + Model fit: The distribution of residuals not symmetrically distributed across 0, thus model fit is not necessarily good

  • t-value for points and attitude is 6.203, which indicates a possible relationship between the two variables. This is also indicated in the coefficient column (PR(>t)), where p < 0.001

  • t-value for deep and stra is close to zero, so likely there is no relationship between points and deep/stra (although p < 0.1 for stra, so there might be a tendency of learning strategies having a relationship with exam points)

  • R2 for the model is 0.2097, so 20.97% of the variance of the response variable (points) could be explained by the predictor variables (attitude, deep and stra)

  • Based on the above, the variables deep and stra were removed from the next regression analysis (See below for results).

  • Model fit: The distribution of residuals more symmetrical than when the other two variables were included in the model => better fit

  • t-value for attitude is 6.124 and p < 0.001, indicating a strong relationship between exam points and attitude

  • R2 (multiple R squared) is 0.1906, so 19.06% of the variance in points was explained by attitude alone


Next, I tested if the model meets the assumptions of linear regression (i.e., linearity, independence, homoscedasticity, normality, no multicolinearity and no endogeneity). The “Residuals vs.fitted” plot can be used to detect non-linearity, unequal error variances, and outliers.In our data, 145, 56 and 35 seem to be outliers (but other than that the residuals are quite evenly distributed around 0). The “Q-Q Residuals” plot is used to check for normality of residuals. Here you can also see the outliers of our data. The “Residuals vs. Leverage” can be used to check for homoscedasticity and non-linearity. With our data, the spread of standardized variables increases as a function of leverage, indicating heteroscedasticity. Based on these results I would either remove outliers from the data or try data transformations (log10, square root..) to meet the assumptions of the linear regression model.

library(GGally)
## Warning: package 'GGally' was built under R version 4.3.2
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(ggplot2)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr   1.1.3     ✔ stringr 1.5.0
## ✔ forcats 1.0.0     ✔ tibble  3.2.1
## ✔ purrr   1.0.2     ✔ tidyr   1.3.0
## ✔ readr   2.1.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readr)

#Set the working directory and read the data into R

setwd("C:/Users/03114911/OneDrive - Valtion/Anne's PhD papers, results, plans etc/MBDP/Open data science/IODS-project")
learning2014_readback <- read_csv("Data/learning2014.csv", show_col_types = FALSE)

#Data structure and dimensions: these allow you to learn basic information of your data
str(learning2014_readback) # all data is numeric, except gender is "character"("M"/"F")
## spc_tbl_ [166 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ gender  : chr [1:166] "F" "M" "F" "M" ...
##  $ age     : num [1:166] 53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num [1:166] 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num [1:166] 3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num [1:166] 3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num [1:166] 2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : num [1:166] 25 12 24 10 22 21 21 31 24 26 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   gender = col_character(),
##   ..   age = col_double(),
##   ..   attitude = col_double(),
##   ..   deep = col_double(),
##   ..   stra = col_double(),
##   ..   surf = col_double(),
##   ..   points = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
dim(learning2014_readback) # the data has 166 rows and 7 columns
## [1] 166   7
#Plot the relationships between variables in learning2014_readback to see how the variables are related and if there are any significant relationships. Alpha set to 0.1, only showing significance p > 0.1.
plot <- ggpairs(learning2014_readback, mapping = aes(col = gender, alpha = 0.1))
plot
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Multiple regression
model1 <- lm(points ~ attitude + deep + stra, data = learning2014_readback)
summary(model1)
## 
## Call:
## lm(formula = points ~ attitude + deep + stra, data = learning2014_readback)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.5239  -3.4276   0.5474   3.8220  11.5112 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.3915     3.4077   3.343  0.00103 ** 
## attitude      3.5254     0.5683   6.203 4.44e-09 ***
## deep         -0.7492     0.7507  -0.998  0.31974    
## stra          0.9621     0.5367   1.793  0.07489 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 162 degrees of freedom
## Multiple R-squared:  0.2097, Adjusted R-squared:  0.195 
## F-statistic: 14.33 on 3 and 162 DF,  p-value: 2.521e-08
model2 <- lm(points ~ attitude, data = learning2014_readback)
summary(model2)
## 
## Call:
## lm(formula = points ~ attitude, data = learning2014_readback)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09
# Testing whether the model meets the assumptions of a linear regression
par(mfrow = c(2,2))
plot(model2, which = c(1,2,5))

date()
## [1] "Mon Nov 13 14:17:42 2023"

Here we go again…


(more chapters to be added similarly as we proceed with the course!)